-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add FileScanConfigBuilder
#15352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FileScanConfigBuilder
#15352
Conversation
FileScanConfig::new(object_store_url, self.schema(), source) | ||
FileScanConfigBuilder::new(object_store_url, self.schema(), source) | ||
.with_projection(projection.cloned()) | ||
.with_limit(limit); | ||
.with_limit(limit) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mertak-synnada, hey, this PR is still WIP but I was wondering if you're happy with this approach. That's what we've discussed in #14685 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, thank you! I haven't been able to test it thoroughly yet, but the legacy ParquetExecBuilder
could be helpful for understanding specific cases, just FYI.
FileScanConfigBuilder
and switch some casesFileScanConfigBuilder
also fyi @AdamGS 👀 |
|
||
// Finally, put it all together into a DataSourceExec | ||
Ok(file_scan_config.build()) | ||
Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it possible to have this return as a function? Is it because of import cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean as aDataSourceExec
function? It also looks a bit verbose to me, but the inner Arc is needed for dynamic dispatch, and the outer one makes the return type more explicit. Happy to make it DataSourceExec::new_arc
if you want, but i don't think we use that a lot in datafusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant just a cosmetic change to not re-writing the whole Arc::new(DataSourceExec::new(Arc::new(file_scan)))
Maybe it can be something like DataSourceExec::from_file_source(file_scan) -> Arc<DataSourceExec>
What would you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you like 748a335 ? Looks cleaner
my 2c - this looks great, I would love to also rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blaginin -- this looks great to me. Thank you !
I think it is really nice that it is backwards compatible as well (deprecated old APIs rather than removing)
I have one request for some more documentation, but I think this one is really good now
@@ -174,6 +175,219 @@ pub struct FileScanConfig { | |||
pub batch_size: Option<usize>, | |||
} | |||
|
|||
#[derive(Clone)] | |||
pub struct FileScanConfigBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add comments and examples here? Perhaps just pointing back to FileScanConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docs for methods and example
@@ -326,14 +544,15 @@ impl FileScanConfig { | |||
/// # Parameters: | |||
/// * `object_store_url`: See [`Self::object_store_url`] | |||
/// * `file_schema`: See [`Self::file_schema`] | |||
#[allow(deprecated)] // `new` will be removed same time as `with_source` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice that it will deprecate the old API rather and will guide users during upgrade
|
||
// Finally, put it all together into a DataSourceExec | ||
Ok(file_scan_config.build()) | ||
Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant just a cosmetic change to not re-writing the whole Arc::new(DataSourceExec::new(Arc::new(file_scan)))
Maybe it can be something like DataSourceExec::from_file_source(file_scan) -> Arc<DataSourceExec>
What would you think?
@@ -645,6 +875,7 @@ impl FileScanConfig { | |||
|
|||
// TODO: This function should be moved into DataSourceExec once FileScanConfig moved out of datafusion/core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also remove this TODO with above suggestion I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looking good to me
Thanks again @blaginin @mertak-synnada |
* WIP: Add `FileScanConfigBuilder` and switch some cases * Fmt * More fmt * Clean `FileScanConfig::build` * Clean `FileScanConfig::new` * Fix csv + fmt * More fixes * Remove pub * Remove todo * Add usage example * Fix input type * Add `from_data_source` * Fmt * Add docs for `with_source`
Related to #14685 (comment)
Rationale for this change
FileScanConfig
now violates single responsibility from SOLID. It serves two conflicting roles:As a builder, though this should be changed as discussed in
datafusion/datafusion/datasource/src/file_scan_config.rs
Line 631 in 635e73b
As a business logic provider (e.g.,
fn project
,impl DataSource
, etc.)These conflicting roles lead to issues like #14905 and #14679, where provider features are accessed even before the build process is complete.
What changes are included in this PR?
I've added
FileScanConfigBuilder
and deprecated builder approach forFileScanConfig
Are these changes tested?
Yes, updated exiting tests into the new interface
Are there any user-facing changes?
Yes, new builder interface - but the switch is quite easy (all builder-methods from
FileScanConfig
are supported)